Transfer manager checksums download #3629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

pulimsr wants to merge 5 commits into main from transfer-manager-checksums-upload-download

+171 −5

Contributor

pulimsr commented Nov 24, 2025

Description of changes: Adding checksum validation on s3 download

Check all that applies:

Did a review by yourself.
Added proper tests to cover this PR. (If tests are not applicable, explain.)
Checked if this PR is a breaking (APIs have been changed) change.
Checked if this PR will not introduce cross-platform inconsistent behavior.
Checked if this PR would require a ReadMe/Wiki update.

Check which platforms you have built SDK on to verify the correctness of this PR.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

pulimsr changed the title ~~Transfer manager checksums upload download~~ Transfer manager checksums download

sbaluja reviewed

View reviewed changes

src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp Outdated

    
                                      actualChecksum = getResult.GetChecksumCRC64NVME();

                                  }

                                  if (!actualChecksum.empty() && actualChecksum != handle->GetChecksum()) {

Contributor

sbaluja Nov 26, 2025

Do we need to check for !actualChecksum.empty()? Does it have to be non-empty or a simple mismatch is enough for us to error out?

Contributor Author

pulimsr Nov 26, 2025

You're right, !actualChecksum.empty() is redundant here.

sbaluja reviewed

View reviewed changes

src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp Outdated

    
                                      } else if (!getResult.GetChecksumCRC64NVME().empty()) {

                                          partState->SetChecksum(getResult.GetChecksumCRC64NVME());

                                      }

Contributor

sbaluja Nov 26, 2025

This logic of retrieving a checksum from an S3Outcome is repeated a few times here. Could be worth trying something like this (maybe with SFINAE) in the header file so we can centralize this logic:

            template<typename OutcomeType>
            Aws::String ExtractChecksum(const OutcomeType& outcome) {
              const auto& result = outcome.GetResult();

              if (!result.GetChecksumCRC32().empty()) {
                return result.GetChecksumCRC32();
              }
              if (!result.GetChecksumCRC32C().empty()) {
                return result.GetChecksumCRC32C();
              }
              if (!result.GetChecksumSHA256().empty()) {
                return result.GetChecksumSHA256();
              }
              if (!result.GetChecksumSHA1().empty()) {
                return result.GetChecksumSHA1();
              }
              if (!result.GetChecksumCRC64NVME().empty()) {
                return result.GetChecksumCRC64NVME();
              }
              return {};
            }

Then we can just call it:

auto actualChecksum = ExtractChecksum(getObjectOutcome);
partState->SetChecksum(ExtractChecksum(outcome))

Contributor Author

pulimsr Nov 26, 2025

Agreed - this refactoring makes sense to eliminate the duplication.

sbaluja reviewed

View reviewed changes

src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp Outdated

    
                          for (const auto& part : completedParts) {

                              sortedParts.emplace_back(part.first, part.second->GetChecksum());

                          }

                          std::sort(sortedParts.begin(), sortedParts.end());

Contributor

sbaluja Nov 26, 2025

I could be COMPLETELY wrong here, but is sorting needed since XOR is associative? Might be missing some s3 specific context


           Add full object checksum validation for Transfer Manager downloads

af496ac

pulimsr force-pushed the transfer-manager-checksums-upload-download branch from c50035d to af496ac Compare

November 28, 2025 18:01


          Merge branch 'main' into transfer-manager-checksums-upload-download

942e23c

sbiscigl reviewed

View reviewed changes

src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp Outdated

    
                        }

                      }

                      bool TransferManager::ValidateDownloadChecksum(const std::shared_ptr<TransferHandle>& handle) {

Contributor

sbiscigl Dec 1, 2025

whenever we return a bool from a function we usually want a Aws::Utils::Outcome instead. instead of simply returning a true or false fail, we should return a proper error if something happened.

src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp

    
                              hashCalculator = Aws::MakeShared<Aws::Utils::Crypto::CRC64>("TransferManager");

                          }

                          auto fileStream = Aws::MakeShared<Aws::FStream>("TransferManager",

Contributor

sbiscigl Dec 1, 2025

So we are opening the file and re-reading the entire stream buffering and then failing the download if the file is different. this gives me two concerns:

a file that is corrupted is written to disc

Failure handling: The S3 Transfer Manager SHOULD download to a file in the same directory with .s3tmp.{uniqueIdentifier} (uniqueIdentifier <= 8 chars) suffix to indicate the file is in the process of being downloaded, but not yet complete. After all content has been written, rename the temporary file to the destination file using atomic rename operation. If there is an error, delete the temporary file. The S3 Transfer Manager SHOULD register cleanup handlers to remove the temporary file if the process terminates unexpectedly, if supported. The S3 Transfer Manager MUST NOT pre-validate available space because it can introduce race conditions and false failures.

This concerns me that this would cause an issue where we would leave a file that has failing data intergrity.

we are reading the stream twice.

on files that are massive think 5TB, this can introduce overhead. i think this is covered in the spec with

If full object checksum is calculated inline while downloading source, the S3 Transfer Manager MAY buffer full parts in-memory while flushing is blocked on updating the running checksum object.

this really means that while we write out to the file we want to be updating the checksum

pulimsr added 3 commits

December 1, 2025 11:03


          Merge branch 'main' into transfer-manager-checksums-upload-download

101e89b


          Merge branch 'main' into transfer-manager-checksums-upload-download

a5262e5


          Download to temporary .s3tmp files with atomic rename on validation s…

1aac67b

…uccess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet